This project is aims to:
Analyze the World cup matches data from 1930 to 2014. The dataset used for analysis is “WorldCupMatches.csv”.
Visualization of Model fitting on Heart disease diagnostic data and its coefficient plot.
The main objective of this projects are to:
The project starts by loading the necessary libraries. These libraries provide functions for data manipulation and visualization, respectively.
# Load libraries
library(tidyverse)
library(lubridate)
library(sf)
library(data.table)
library(fastDummies)
library(dotwhisker)
library(ROCR)
#Load the data
worldcup_matches <- read_csv("data/WorldCupMatches.csv")
str(worldcup_matches)
spc_tbl_ [4,572 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Year : num [1:4572] 1930 1930 1930 1930 1930 1930 1930 1930 1930 1930 ...
$ Datetime : chr [1:4572] "13 Jul 1930 - 15:00" "13 Jul 1930 - 15:00" "14 Jul 1930 - 12:45" "14 Jul 1930 - 14:50" ...
$ Stage : chr [1:4572] "Group 1" "Group 4" "Group 2" "Group 3" ...
$ Stadium : chr [1:4572] "Pocitos" "Parque Central" "Parque Central" "Pocitos" ...
$ City : chr [1:4572] "Montevideo" "Montevideo" "Montevideo" "Montevideo" ...
$ Home Team Name : chr [1:4572] "France" "USA" "Yugoslavia" "Romania" ...
$ Home Team Goals : num [1:4572] 4 3 2 3 1 3 4 3 1 1 ...
$ Away Team Goals : num [1:4572] 1 0 1 1 0 0 0 0 0 0 ...
$ Away Team Name : chr [1:4572] "Mexico" "Belgium" "Brazil" "Peru" ...
$ Win conditions : chr [1:4572] NA NA NA NA ...
$ Attendance : num [1:4572] 4444 18346 24059 2549 23409 ...
$ Half-time Home Goals: num [1:4572] 3 2 2 1 0 1 0 2 0 0 ...
$ Half-time Away Goals: num [1:4572] 0 0 0 0 0 0 0 0 0 0 ...
$ Referee : chr [1:4572] "LOMBARDI Domingo (URU)" "MACIAS Jose (ARG)" "TEJADA Anibal (URU)" "WARNKEN Alberto (CHI)" ...
$ Assistant 1 : chr [1:4572] "CRISTOPHE Henry (BEL)" "MATEUCCI Francisco (URU)" "VALLARINO Ricardo (URU)" "LANGENUS Jean (BEL)" ...
$ Assistant 2 : chr [1:4572] "REGO Gilberto (BRA)" "WARNKEN Alberto (CHI)" "BALWAY Thomas (FRA)" "MATEUCCI Francisco (URU)" ...
$ RoundID : num [1:4572] 201 201 201 201 201 201 201 201 201 201 ...
$ MatchID : num [1:4572] 1096 1090 1093 1098 1085 ...
$ Home Team Initials : chr [1:4572] "FRA" "USA" "YUG" "ROU" ...
$ Away Team Initials : chr [1:4572] "MEX" "BEL" "BRA" "PER" ...
- attr(*, "spec")=
.. cols(
.. Year = col_double(),
.. Datetime = col_character(),
.. Stage = col_character(),
.. Stadium = col_character(),
.. City = col_character(),
.. `Home Team Name` = col_character(),
.. `Home Team Goals` = col_double(),
.. `Away Team Goals` = col_double(),
.. `Away Team Name` = col_character(),
.. `Win conditions` = col_character(),
.. Attendance = col_double(),
.. `Half-time Home Goals` = col_double(),
.. `Half-time Away Goals` = col_double(),
.. Referee = col_character(),
.. `Assistant 1` = col_character(),
.. `Assistant 2` = col_character(),
.. RoundID = col_double(),
.. MatchID = col_double(),
.. `Home Team Initials` = col_character(),
.. `Away Team Initials` = col_character()
.. )
- attr(*, "problems")=<externalptr>
The data cleaning and processing steps include the following:
Empty rows has been removed from the dataset
# Remove rows with empty values
worldcup_matches <- worldcup_matches[!is.na(worldcup_matches$Year), ]
Converted Datetime column into Date column and Time column with appropriate format
# Convert the datetime column to a POSIXct object
worldcup_matches$Datetime <- as.POSIXct(worldcup_matches$Datetime, format = "%d %b %Y - %H:%M", tz = "UTC")
# Separate the datetime column into date and time
worldcup_matches$Date <- as.Date(worldcup_matches$Datetime)
worldcup_matches$Time <- format(worldcup_matches$Datetime, format = "%H:%M")
Renaming specific column names for clarity
# Assuming you want to rename the "Win conditions" column to "Outcome"
colnames(worldcup_matches)[6] <- "home_team_name"
colnames(worldcup_matches)[7] <- "home_team_goals"
colnames(worldcup_matches)[8] <- "away_team_goals"
colnames(worldcup_matches)[9] <- "away_team_name"
colnames(worldcup_matches)[10] <- "win_conditions"
colnames(worldcup_matches)[12] <- "half_time_home_goals"
colnames(worldcup_matches)[13] <- "half_time_away_goals"
colnames(worldcup_matches)[15] <- "assistant1"
colnames(worldcup_matches)[16] <- "assistant2"
colnames(worldcup_matches)[19] <- "home_team_initials"
colnames(worldcup_matches)[20] <- "away_team_initials"
colnames(worldcup_matches)
[1] "Year" "Datetime" "Stage" "Stadium" "City" "home_team_name" "home_team_goals" "away_team_goals"
[9] "away_team_name" "win_conditions" "Attendance" "half_time_home_goals" "half_time_away_goals" "Referee" "assistant1" "assistant2"
[17] "RoundID" "MatchID" "home_team_initials" "away_team_initials" "Date" "Time"
Creating Goals column by adding home team goals and away team goals
# Create the "Goals" column
worldcup_matches$Goals <- worldcup_matches$home_team_goals + worldcup_matches$away_team_goals
# Create the "Match Outcome" column
worldcup_matches$outcome <- ifelse(worldcup_matches$home_team_goals > worldcup_matches$away_team_goals, "Home Team Win", "Away Team Win")
# Calculate summary statistics
summary(worldcup_matches$home_team_goals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 1.811 3.000 10.000
summary(worldcup_matches$away_team_goals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 1.022 2.000 7.000
From above summary statistics we can determine that maximum number of goals scored by home team are 10 and Away team scored 7.
Cleaning the dataset by replacing the old names of country with current names to map them with number of goals scored.
worldcup_matches = worldcup_matches %>% mutate(home_country=case_when(home_team_name %like% 'China' ~ "China"
, home_team_name=="C�te d'Ivoire" ~ 'Ivory Coast'
, home_team_name %like% "Czech" ~ 'Czechia'
, home_team_name=="England" ~ 'United Kingdom'
, home_team_name=="Scotland" ~ 'United Kingdom'
, home_team_name %like% "German" ~ 'Germany'
, home_team_name %like% "Iran" ~ 'Iran'
, home_team_name %like% "Korea DPR" ~ 'North Korea'
, home_team_name %like% "Korea" ~ 'South Korea'
, home_team_name %like% "Ireland" ~ 'Ireland'
, home_team_name %like% "Serbia" ~ 'Republic of Serbia'
, home_team_name %like% "Yugoslavia" ~ 'Republic of Serbia'
, home_team_name %like% "Soviet Union" ~ 'Russia'
, home_team_name %like% "USA" ~ 'United States of America'
, home_team_name %like% "Wales" ~ 'United Kingdom'
, home_team_name %like% "Zaire" ~ 'Republic of the Congo'
, home_team_name %like% "Bosnia" ~ 'Bosnia and Herzegovina'
, home_team_name %like% "Trinidad and Tobago" ~ "Trinidad and Tobago"
, home_team_name %like% "United Arab Emirates" ~ "United Arab Emirates"
, home_team_name %like% "Dutch East Indies" ~ "Indonesia"
, TRUE ~ home_team_name
)
, away_country=case_when(away_team_name %like% 'China' ~ "China"
, away_team_name=="C�te d'Ivoire" ~ 'Ivory Coast'
, away_team_name %like% "Czech" ~ 'Czechia'
, away_team_name=="England" ~ 'United Kingdom'
, away_team_name=="Scotland" ~ 'United Kingdom'
, away_team_name %like% "German" ~ 'Germany'
, away_team_name %like% "Iran" ~ 'Iran'
, away_team_name %like% "Korea DPR" ~ 'North Korea'
, away_team_name %like% "Korea" ~ 'South Korea'
, away_team_name %like% "Ireland" ~ 'Ireland'
, away_team_name %like% "Serbia" ~ 'Republic of Serbia'
, away_team_name %like% "Yugoslavia" ~ 'Republic of Serbia'
, away_team_name %like% "Soviet Union" ~ 'Russia'
, away_team_name %like% "USA" ~ 'United States of America'
, away_team_name %like% "Wales" ~ 'United Kingdom'
, away_team_name %like% "Zaire" ~ 'Republic of the Congo'
, away_team_name %like% "Bosnia" ~ 'Bosnia and Herzegovina'
, away_team_name %like% "Trinidad and Tobago" ~ "Trinidad and Tobago"
, away_team_name %like% "United Arab Emirates" ~ "United Arab Emirates"
, away_team_name %like% "Dutch East Indies" ~ "Indonesia"
, TRUE ~ away_team_name
)
)
The average number of goals scored by the away team is visualized on a map using the world shapefile data from Natural Earth. The “average_goals” variable calculated from the World Cup matches data by aggregating the home team and away team goals based on the country and marked them on map by grouping with country.
world_shapes <- read_sf("data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")
worldcup_away_goals <- worldcup_matches %>% group_by(away_country) %>%summarise(average_goals = mean(away_team_goals))
away_spatial_data <- world_shapes %>%
left_join(worldcup_away_goals, by = c("ADMIN" = "away_country")) %>%
filter(ISO_A3 != "ATA")
# Plot the map
ggplot(away_spatial_data, aes(group = ADMIN)) +
geom_sf(aes(fill = average_goals)) +
coord_sf(crs = "+proj=robin") +
scale_fill_gradient(low = "blue", high = "red", na.value = "gray", name = "Away Team Goals") +
ggtitle("Average Number of Away Team Goals in World Cup Matches") +
theme_minimal() +
theme(plot.title = element_text(size = 18, face = "bold"),
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10),
legend.position = "bottom",
panel.grid = element_blank(),
axis.text = element_blank(),
axis.title = element_blank())
NA
NA
From the above map we can see that Germany scored highest avaerage number of goals in the worldcup matches followed by Netherlands and Brazil.
worldcup_home_goals <- worldcup_matches %>% group_by(home_country) %>%summarise(average_goals = mean(home_team_goals))
home_spatial_data <- world_shapes %>%
left_join(worldcup_home_goals, by = c("ADMIN" = "home_country")) %>%
filter(ISO_A3 != "ATA")
# Plot the map
ggplot(home_spatial_data, aes(group = ADMIN)) +
geom_sf(aes(fill = average_goals)) +
coord_sf(crs = "+proj=robin") +
scale_fill_gradient(low = "blue", high = "red", na.value = "gray", name = "Home Team Goals") +
labs(title = "Average Number of Goals Scored by Home Teams in World Cup Matches") +
theme_minimal() +
theme(plot.title = element_text(size = 18, face = "bold"),
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10),
legend.position = "bottom",
panel.grid = element_blank(),
axis.text = element_blank(),
axis.title = element_blank())
From the above map we can see that Turkey scored highest avaerage number of goals in the worldcup matches followed by Hungary.
Analyze the number of matches won by Home team and Away team as outcome of the match
# Bar chart of match outcome
b <- ggplot(matches_countries_cleaned, aes(x = outcome, fill = outcome)) +
geom_bar( width = 0.3) +
labs(title = "Match Outcome",
x = "Outcome", y = "Count") +
scale_fill_manual(values = c("Home Team Win" = "steelblue", "Away Team Win" = "salmon"))
ggplotly(b)
The total number of goals scored in each year from 1930 to 2014 is visualized using a bar chart. The chart depicts the trend of goals scored over time.
# Create the bar chart
Goals_plot <- ggplot(matches_countries_cleaned, aes(x = Year, y = Goals)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Total Goals Scored Every Year (1930-2014)",
x = "Year", y = "Goals") +
theme_minimal()
ggplotly(Goals_plot)
NA
As we can see from above plot, there are more number of goals scored in the year 2000. The worldcup games has been held every four years, except for the years 1942 and 1946, during World War II. due that we dont see any data for that period in the plot.
Worldcup_goals_by_year <- matches_countries_cleaned %>%
group_by(Year) %>%
mutate(total_goals = sum(Goals))
# Line plot: Number of goals over different years
l <- ggplot(Worldcup_goals_by_year, aes(x = Year, y = total_goals)) +
geom_line(color="blue") +
xlab("Year") + ylab("Number of Goals") +
ggtitle("Number of Goals Over Different Years") +
theme_minimal()
ggplotly(l)
NA
The above plot shows us the number of goals scored in world cup over the years 1930 to 2014.
heart_data <- drop_na(read.csv("data/Heart.csv")[, -1])
str(heart_data)
'data.frame': 297 obs. of 14 variables:
$ Age : int 63 67 67 37 41 56 62 57 63 53 ...
$ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
$ ChestPain: chr "typical" "asymptomatic" "asymptomatic" "nonanginal" ...
$ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
$ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
$ Fbs : int 1 0 0 0 0 0 0 0 0 1 ...
$ RestECG : int 2 2 2 0 2 0 2 0 2 2 ...
$ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
$ ExAng : int 0 1 1 0 0 0 0 1 0 1 ...
$ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
$ Slope : int 3 2 2 3 1 1 3 1 2 3 ...
$ Ca : int 0 3 2 0 0 0 2 0 1 0 ...
$ Thal : chr "fixed" "normal" "reversable" "normal" ...
$ AHD : chr "No" "Yes" "Yes" "No" ...
The dataset is preprocessed by converting the “AHD” (heart disease diagnosis) column into binary values (0 for “No” and 1 for “Yes”) becasue the target variable is in character format. .
heart_data$AHD<-ifelse(heart_data$AHD=="Yes",1,0)
summary(heart_data)
Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
Min. :29.00 Min. :0.0000 Length:297 Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0 Min. :0.0000 Min. :0.000 Min. :1.000
1st Qu.:48.00 1st Qu.:0.0000 Class :character 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.0 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000
Median :56.00 Median :1.0000 Mode :character Median :130.0 Median :243.0 Median :0.0000 Median :1.0000 Median :153.0 Median :0.0000 Median :0.800 Median :2.000
Mean :54.54 Mean :0.6768 Mean :131.7 Mean :247.4 Mean :0.1448 Mean :0.9966 Mean :149.6 Mean :0.3266 Mean :1.056 Mean :1.603
3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:140.0 3rd Qu.:276.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000
Max. :77.00 Max. :1.0000 Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0 Max. :1.0000 Max. :6.200 Max. :3.000
Ca Thal AHD
Min. :0.0000 Length:297 Min. :0.0000
1st Qu.:0.0000 Class :character 1st Qu.:0.0000
Median :0.0000 Mode :character Median :0.0000
Mean :0.6768 Mean :0.4613
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :3.0000 Max. :1.0000
Since the Chestpain and Thal are the catagorical variables creating dummy variables into numeric variable to fit the model using those variables.
mdl_data = dummy_cols(heart_data, select_columns=c( "ChestPain", "Thal"), remove_selected_columns=TRUE)
A logistic regression model is fitted to the heart disease data using the glm() function. The model’s coefficients and summary statistics are displayed. Additionally, a coefficient plot is generated to visualize the coefficients.
mdl = glm(AHD ~., family=binomial(link='logit'), data=mdl_data)
Summary for the model coefficients,
summary(mdl)
Call:
glm(formula = AHD ~ ., family = binomial(link = "logit"), data = mdl_data)
Coefficients: (2 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.544716 2.978667 -1.526 0.127071
Age -0.012296 0.024664 -0.499 0.618120
Sex 1.431422 0.513185 2.789 0.005282 **
RestBP 0.023981 0.011110 2.159 0.030889 *
Chol 0.004930 0.003944 1.250 0.211306
Fbs -0.610758 0.599184 -1.019 0.308052
RestECG 0.255433 0.189565 1.347 0.177829
MaxHR -0.021281 0.010821 -1.967 0.049224 *
ExAng 0.739431 0.434687 1.701 0.088931 .
Oldpeak 0.353095 0.230102 1.535 0.124903
Slope 0.670508 0.371616 1.804 0.071184 .
Ca 1.269290 0.271304 4.678 2.89e-06 ***
ChestPain_asymptomatic 2.006802 0.652608 3.075 0.002105 **
ChestPain_nonanginal 0.202175 0.648718 0.312 0.755304
ChestPain_nontypical 1.071153 0.753902 1.421 0.155371
ChestPain_typical NA NA NA NA
Thal_fixed -1.429947 0.783279 -1.826 0.067912 .
Thal_normal -1.441377 0.418558 -3.444 0.000574 ***
Thal_reversable NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 409.95 on 296 degrees of freedom
Residual deviance: 194.83 on 280 degrees of freedom
AIC: 228.83
Number of Fisher Scoring iterations: 6
dwplot(mdl)
From above coefficient plot and summary we can clearly identify Sex, Ca, ChestPain_asymptomic and Thal_normal are the most siginificant varibales for the model prediction.
y_hat = predict(mdl, mdl_data)
pr = prediction(y_hat, mdl_data$AHD)
auc <- performance(pr, measure = "auc")
title = paste('AUC: ', auc@y.values[[1]])
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf, main=title)
The area under the receiver operating characteristic (ROC) curve is calculated to evaluate the model’s performance. The ROC is a metric used to evaluate the performance of a classification model. The ROC curve is created by plotting the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various classification thresholds.
AUC-ROC > 0.5: The model performs better than random guessing. The higher the value, the better the performance.
As we can see this model generated AUC-ROC = 0.933 which is a good prediction.
The project successfully analyzes World Cup matches data and visualizes the average number of goals scored by the home team and away team on maps. Additionally, it performs model fitting and visualization on heart disease diagnostic data, providing insights into heart disease prediction. The visualizations and analyses contribute to a better understanding of the datasets and their patterns.